A Note on Topical N-grams
نویسندگان
چکیده
Most of the popular topic models (such as Latent Dirichlet Allocation) have an underlying assumption: bag of words. However, text is indeed a sequence of discrete word tokens, and without considering the order of words (in another word, the nearby context where a word is located), the accurate meaning of language cannot be exactly captured by word co-occurrences only. In this sense, collocations of words (phrases) have to be considered. However, like individual words, phrases sometimes show polysemy as well depending on the context. More noticeably, a composition of two (or more) words is a phrase in some context, but not in other contexts. In this paper, we propose a new probabilistic generative model that automatically determines unigram words and phrases based on context and simultaneously associates them with mixture of topics, and show very interesting results on large text corpora.
منابع مشابه
N-gramas sintácticos no-continuos
In this paper, we present the concept of noncontinuous syntactic n-grams. In our previous works we introduced the general concept of syntactic n-grams, i.e., n-grams that are constructed by following paths in syntactic trees. Their great advantage is that they allow introducing of the merely linguistic (syntactic) information into machine learning methods. Certain disadvantage is that previous ...
متن کاملComparison of the healing effects of topical Phenytoin, Estrogen and Silver Sulfadiazine on skin wounds in male rats
Background and aim: Shortening of the duration of wound healing is an attractive subject for investigators in recent years. In this study, the effects of topical phenytoin, silver sulfadiazine, estrogen and their combination on wound healing were evaluated.Materials and Methods: This experimental study was accomplised on 30 male albino rats, which had an approximate wei...
متن کاملModeling Harmony with Skip-Grams
String-based (or viewpoint) models of tonal harmony often struggle with data sparsity in pattern discovery and prediction tasks, particularly when modeling composite events like triads and seventh chords, since the number of distinct n-note combinations in polyphonic textures is potentially enormous. To address this problem, this study examines the efficacy of skip-grams in music research, an a...
متن کاملn-Grams: Language-Independent Categorization of Text
A language-independent means of gauging topical similarity in unrestricted text is described. The method combines information derived from n-grams (consecutive sequences of n characters) with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents. No prior information about document content or language is requir...
متن کاملA New Method of N-gram Statistics for Large Number of n and Automatic Extraction of Words and Phrases from Large Text Data of Japanese
In the process of establishing the information theory, C. E. Shannon proposed the Markov process as a good model to characterize a natural language. The core of this idea is to calculate the frequencies of strings composed of n characters (n-grams), but this statistical analysis of large text data and for a large n has never been carried out because of the memory limitation of computer and the ...
متن کامل